library(car)
library(ggcorrplot)
library(tidyverse)
library(stringr)
library(ggpubr)
library(knitr)
library(MASS)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(splines)
library(mgcv)
library(reshape2)
library(gridExtra)
library(lmPerm)
library(caret)
Data <- read_csv(file.path(PSDS_PATH, 'dataset.csv'))
rm(dir1,dir2,dir3,dir4,dir5,dir6,dir7,dir8,dir9,dir10,file_name,PSDS_PATH)
str(Data)
## spc_tbl_ [9,134 × 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Customer : chr [1:9134] "SP56326" "VQ57815" "JG36165" "LO75919" ...
## $ State : chr [1:9134] "Arizona" "Oregon" "California" "Arizona" ...
## $ Customer Lifetime Value : num [1:9134] 7453 8828 6043 6008 15245 ...
## $ Response : chr [1:9134] "No" "No" "No" "No" ...
## $ Coverage : chr [1:9134] "Premium" "Basic" "Extended" "Extended" ...
## $ Education : chr [1:9134] "High School or Below" "Doctor" "Bachelor" "Bachelor" ...
## $ Effective To Date : chr [1:9134] "01/01/11" "01/01/11" "01/01/11" "01/01/11" ...
## $ EmploymentStatus : chr [1:9134] "Medical Leave" "Employed" "Unemployed" "Employed" ...
## $ Gender : chr [1:9134] "F" "M" "M" "M" ...
## $ Income : num [1:9134] 12195 61246 0 43543 30205 ...
## $ Location Code : chr [1:9134] "Suburban" "Rural" "Suburban" "Urban" ...
## $ Marital Status : chr [1:9134] "Divorced" "Married" "Single" "Married" ...
## $ Monthly Premium Auto : num [1:9134] 101 110 83 76 195 74 66 112 74 66 ...
## $ Months Since Last Claim : num [1:9134] 29 24 12 11 24 21 8 6 3 15 ...
## $ Months Since Policy Inception: num [1:9134] 35 21 21 62 1 30 93 71 42 45 ...
## $ Number of Open Complaints : num [1:9134] 0 0 0 0 0 0 0 0 0 0 ...
## $ Number of Policies : num [1:9134] 7 3 5 8 4 2 1 9 3 1 ...
## $ Policy Type : chr [1:9134] "Personal Auto" "Personal Auto" "Personal Auto" "Personal Auto" ...
## $ Policy : chr [1:9134] "Personal L3" "Personal L2" "Personal L1" "Personal L3" ...
## $ Renew Offer Type : chr [1:9134] "Offer1" "Offer2" "Offer4" "Offer1" ...
## $ Sales Channel : chr [1:9134] "Branch" "Branch" "Web" "Call Center" ...
## $ Total Claim Amount : num [1:9134] 671 159 904 339 1330 ...
## $ Vehicle Class : chr [1:9134] "Four-Door Car" "Sports Car" "Two-Door Car" "Two-Door Car" ...
## $ Vehicle Size : chr [1:9134] "Medsize" "Medsize" "Medsize" "Medsize" ...
## - attr(*, "spec")=
## .. cols(
## .. Customer = col_character(),
## .. State = col_character(),
## .. `Customer Lifetime Value` = col_double(),
## .. Response = col_character(),
## .. Coverage = col_character(),
## .. Education = col_character(),
## .. `Effective To Date` = col_character(),
## .. EmploymentStatus = col_character(),
## .. Gender = col_character(),
## .. Income = col_double(),
## .. `Location Code` = col_character(),
## .. `Marital Status` = col_character(),
## .. `Monthly Premium Auto` = col_double(),
## .. `Months Since Last Claim` = col_double(),
## .. `Months Since Policy Inception` = col_double(),
## .. `Number of Open Complaints` = col_double(),
## .. `Number of Policies` = col_double(),
## .. `Policy Type` = col_character(),
## .. Policy = col_character(),
## .. `Renew Offer Type` = col_character(),
## .. `Sales Channel` = col_character(),
## .. `Total Claim Amount` = col_double(),
## .. `Vehicle Class` = col_character(),
## .. `Vehicle Size` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
duplicates <- duplicated(Data$Customer)
num_true <- sum(duplicates)
print(num_true)
## [1] 0
remove(duplicates,num_true)
We can conclude that there are no duplicates.
Data <- Data %>%
rename(Customer_Lifetime_Value=`Customer Lifetime Value`,
Effective_To_Date=`Effective To Date`,
Location_Code=`Location Code`,
Employment_Status=EmploymentStatus,
Marital_Status=`Marital Status`,
Monthly_Premium_Auto=`Monthly Premium Auto`,
Months_Since_Last_Claim=`Months Since Last Claim`,
Months_Since_Policy_Inception=`Months Since Policy Inception`,
Number_of_Open_Complaints=`Number of Open Complaints`,
Number_of_Policies=`Number of Policies`,
Policy_Type=`Policy Type`,
Renew_Offer_Type=`Renew Offer Type`,
Sales_Channel=`Sales Channel`,
Total_Claim_Amount=`Total Claim Amount`,
Vehicle_Class=`Vehicle Class`,
Vehicle_Size=`Vehicle Size`)
Data <- Data %>%
mutate(`Effective_To_Date`=mdy(`Effective_To_Date`))
Data$Effective_Days <- weekdays(Data$`Effective_To_Date`)
Data$Effective_Days <- factor(Data$Effective_Days, c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
Analyze each predictor variable individually to understand its distribution and identify any potential outliers or data issues.
Histograms: Create histograms to visualize the distribution of each continuous predictor variable.
Box plots: Use box plots to identify the range, quartiles, and potential outliers of continuous predictor variables.
Bar plots: Create bar plots for categorical predictor variables to visualize the frequency of each category.
Summary statistics: Calculate summary statistics, such as mean, median, standard deviation, and quartiles, to describe the central tendency and dispersion of each continuous predictor variable.
Analyze the relationship between each predictor variable and the target variable.
Scatter plots: Plot scatter plots between continuous predictor variables and the target variable to visualize the relationships and identify any trends or patterns.
Box plots or violin plots: Use box plots or violin plots to visualize the distribution of the target variable across different categories of a categorical predictor variable.
Correlation matrix: Calculate the correlation matrix for continuous predictor variables to identify any strong linear relationships with the target variable.
CLV = Customer Lifetime Value
INC = Income
MPA = Monthly Premium Auto
TCA = Total Claim Amount
As CLV has a high correlation with MPA, We decided to study how other categorical values affect MPA.
In this section, we will perform a concise statistical analysis to select the most relevant predictors for our regression model. The steps involved in this process are as follows:
Multicollinearity assessment: Check for correlations among predictor variables to ensure stability and reliability in the regression model.
Bivariate analysis: Investigate associations between each predictor and the target variable using correlation coefficients and hypothesis tests.
Feature selection: Employ techniques like stepwise regression, LASSO, or Ridge regression to systematically choose the optimal subset of predictors.
Visualizations: Leverage graphical representations to effectively communicate findings and support informed decision-making.
By following these steps, we will identify a suitable set of predictor variables that contribute meaningfully to our regression model’s performance.
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## State1 4 5.1550e+07 12887444 1533 0.9289
## Residuals 9129 4.3112e+11 47225235
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Response1 1 3.4380e+07 34380462 184 0.3533
## Residuals 9132 4.3114e+11 47211601
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Location_Code1 2 1.0200e+07 5099770 51 1
## Residuals 9131 4.3116e+11 47219419
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Marital_Status1 2 3.1310e+08 156547994 5000 < 2.2e-16 ***
## Residuals 9131 4.3086e+11 47186247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Marital_Status, data = Data)
##
## Coefficients:
## (Intercept) Marital_StatusMarried Marital_StatusSingle
## 8241.2 -162.3 -526.4
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Marital_Status, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6237 -3990 -2219 956 75246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8241.2 185.7 44.390 <2e-16 ***
## Marital_StatusMarried -162.3 208.3 -0.779 0.436
## Marital_StatusSingle -526.4 231.5 -2.274 0.023 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6869 on 9131 degrees of freedom
## Multiple R-squared: 0.0007262, Adjusted R-squared: 0.0005073
## F-statistic: 3.318 on 2 and 9131 DF, p-value: 0.03628
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Coverage1 2 1.2265e+10 6132650242 5000 < 2.2e-16 ***
## Residuals 9131 4.1891e+11 45877277
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Coverage, data = Data)
##
## Coefficients:
## (Intercept) CoverageExtended CoveragePremium
## 7191 1599 3705
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Coverage, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7723 -4300 -1971 1185 74536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7190.71 90.77 79.22 <2e-16 ***
## CoverageExtended 1598.97 158.02 10.12 <2e-16 ***
## CoveragePremium 3704.90 252.82 14.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6773 on 9131 degrees of freedom
## Multiple R-squared: 0.02845, Adjusted R-squared: 0.02823
## F-statistic: 133.7 on 2 and 9131 DF, p-value: < 2.2e-16
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Education1 4 4.5725e+08 114312631 5000 < 2.2e-16 ***
## Residuals 9129 4.3071e+11 47180794
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Education, data = Data)
##
## Coefficients:
## (Intercept) EducationCollege
## 7872.7 -21.6
## EducationDoctor EducationHigh School or Below
## -352.3 424.0
## EducationMaster
## 370.8
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Education, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6356 -3992 -2187 879 75029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7872.7 131.0 60.082 <2e-16 ***
## EducationCollege -21.6 186.5 -0.116 0.9078
## EducationDoctor -352.3 393.9 -0.895 0.3711
## EducationHigh School or Below 424.1 187.5 2.261 0.0238 *
## EducationMaster 370.8 284.3 1.304 0.1922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6869 on 9129 degrees of freedom
## Multiple R-squared: 0.00106, Adjusted R-squared: 0.0006228
## F-statistic: 2.423 on 4 and 9129 DF, p-value: 0.04604
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Employment_Status1 4 7.1856e+08 179640773 5000 < 2.2e-16 ***
## Residuals 9129 4.3045e+11 47152169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Employment_Status, data = Data)
##
## Coefficients:
## (Intercept) Employment_StatusEmployed
## 7847.9 371.2
## Employment_StatusMedical Leave Employment_StatusRetired
## -206.1 -360.0
## Employment_StatusUnemployed
## -211.6
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Employment_Status, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5964 -3952 -2285 895 75106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7847.9 341.2 23.000 <2e-16 ***
## Employment_StatusEmployed 371.2 353.1 1.051 0.293
## Employment_StatusMedical Leave -206.1 474.9 -0.434 0.664
## Employment_StatusRetired -360.0 532.6 -0.676 0.499
## Employment_StatusUnemployed -211.6 369.8 -0.572 0.567
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6867 on 9129 degrees of freedom
## Multiple R-squared: 0.001667, Adjusted R-squared: 0.001229
## F-statistic: 3.81 on 4 and 9129 DF, p-value: 0.004251
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Gender1 1 7.9863e+07 79863478 5000 0.0036 **
## Residuals 9132 4.3109e+11 47206620
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Gender, data = Data)
##
## Coefficients:
## (Intercept) GenderM
## 8096.6 -187.1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Gender, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6198 -4001 -2218 940 75416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8096.6 100.7 80.427 <2e-16 ***
## GenderM -187.1 143.8 -1.301 0.193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6871 on 9132 degrees of freedom
## Multiple R-squared: 0.0001852, Adjusted R-squared: 7.574e-05
## F-statistic: 1.692 on 1 and 9132 DF, p-value: 0.1934
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Policy_Type1 2 2.0613e+08 103063571 5000 0.1696
## Residuals 9131 4.3096e+11 47197962
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Policy1 8 4.4710e+08 55887813 5000 0.5334
## Residuals 9125 4.3072e+11 47202588
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Vehicle_Class1 5 5.5043e+10 1.1009e+10 5000 < 2.2e-16 ***
## Residuals 9128 3.7613e+11 4.1206e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Class, data = Data)
##
## Coefficients:
## (Intercept) Vehicle_ClassLuxury Car
## 6631.7 10421.6
## Vehicle_ClassLuxury SUV Vehicle_ClassSports Car
## 10491.3 4119.3
## Vehicle_ClassSUV Vehicle_ClassTwo-Door Car
## 3811.8 39.3
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Class, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11167 -3769 -1511 1075 66272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6631.73 94.43 70.229 <2e-16 ***
## Vehicle_ClassLuxury Car 10421.62 511.58 20.371 <2e-16 ***
## Vehicle_ClassLuxury SUV 10491.27 482.56 21.741 <2e-16 ***
## Vehicle_ClassSports Car 4119.26 306.68 13.432 <2e-16 ***
## Vehicle_ClassSUV 3811.79 178.49 21.355 <2e-16 ***
## Vehicle_ClassTwo-Door Car 39.30 175.40 0.224 0.823
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6419 on 9128 degrees of freedom
## Multiple R-squared: 0.1277, Adjusted R-squared: 0.1272
## F-statistic: 267.2 on 5 and 9128 DF, p-value: < 2.2e-16
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Vehicle_Size1 2 2.2489e+08 112444031 5000 0.086 .
## Residuals 9131 4.3095e+11 47195907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Size, data = Data)
##
## Coefficients:
## (Intercept) Vehicle_SizeMedsize Vehicle_SizeSmall
## 7545.0 505.7 540.1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Size, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6186 -3977 -2212 981 75240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7545.0 223.4 33.779 <2e-16 ***
## Vehicle_SizeMedsize 505.7 239.2 2.114 0.0346 *
## Vehicle_SizeSmall 540.1 276.8 1.951 0.0511 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6870 on 9131 degrees of freedom
## Multiple R-squared: 0.0005216, Adjusted R-squared: 0.0003027
## F-statistic: 2.382 on 2 and 9131 DF, p-value: 0.09238
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Renew_Offer_Type1 3 3.6291e+09 1209695120 5000 < 2.2e-16 ***
## Residuals 9130 4.2754e+11 46828218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Renew_Offer_Type, data = Data)
##
## Coefficients:
## (Intercept) Renew_Offer_TypeOffer2 Renew_Offer_TypeOffer3
## 8707.1 -1310.3 -709.2
## Renew_Offer_TypeOffer4
## -1527.1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Renew_Offer_Type, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6809 -4046 -2028 1055 74618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8707.1 111.7 77.938 < 2e-16 ***
## Renew_Offer_TypeOffer2 -1310.3 168.8 -7.764 9.13e-15 ***
## Renew_Offer_TypeOffer3 -709.2 212.6 -3.336 0.000852 ***
## Renew_Offer_TypeOffer4 -1527.1 241.3 -6.330 2.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6843 on 9130 degrees of freedom
## Multiple R-squared: 0.008417, Adjusted R-squared: 0.008091
## F-statistic: 25.83 on 3 and 9130 DF, p-value: < 2.2e-16
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Sales_Channel1 3 1.2472e+08 41572350 1959 0.6131
## Residuals 9130 4.3105e+11 47212048
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Sales_Channel, data = Data)
##
## Coefficients:
## (Intercept) Sales_ChannelBranch Sales_ChannelCall Center
## 7957.7 162.0 142.4
## Sales_ChannelWeb
## -177.9
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Sales_Channel, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6202 -4001 -2209 931 75225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7957.7 116.5 68.291 <2e-16 ***
## Sales_ChannelBranch 162.0 178.8 0.906 0.365
## Sales_ChannelCall Center 142.4 200.8 0.709 0.478
## Sales_ChannelWeb -177.9 221.8 -0.802 0.423
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6871 on 9130 degrees of freedom
## Multiple R-squared: 0.0002893, Adjusted R-squared: -3.924e-05
## F-statistic: 0.8805 on 3 and 9130 DF, p-value: 0.4503
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Effective_Days 6 1.4290e+08 23816121 5000 0.9998
## Residuals 9127 4.3103e+11 47225575
## [1] "Settings: unique SS : numeric variables centered"
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Number_of_Open_Complaints 1 5.695e+08 569502273 5000 < 2.2e-16 ***
## Residuals 9132 4.306e+11 47153002
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Open_Complaints,
## data = Data)
##
## Coefficients:
## (Intercept) Number_of_Open_Complaints
## 8110.4 -274.3
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Open_Complaints,
## data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6116 -4019 -2215 971 75215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8110.38 77.99 103.990 < 2e-16 ***
## Number_of_Open_Complaints -274.29 78.93 -3.475 0.000513 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6867 on 9132 degrees of freedom
## Multiple R-squared: 0.001321, Adjusted R-squared: 0.001211
## F-statistic: 12.08 on 1 and 9132 DF, p-value: 0.0005126
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Policy_Inception,
## data = Data)
##
## Coefficients:
## (Intercept) Months_Since_Policy_Inception
## 7893.480 2.319
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Policy_Inception,
## data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6206 -4007 -2228 952 75260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7893.480 143.192 55.12 <2e-16 ***
## Months_Since_Policy_Inception 2.319 2.576 0.90 0.368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6871 on 9132 degrees of freedom
## Multiple R-squared: 8.871e-05, Adjusted R-squared: -2.079e-05
## F-statistic: 0.8101 on 1 and 9132 DF, p-value: 0.3681
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Income, data = Data)
##
## Coefficients:
## (Intercept) Income
## 7.797e+03 5.511e-03
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Income, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5899 -3976 -2274 917 75203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.797e+03 1.145e+02 68.114 <2e-16 ***
## Income 5.511e-03 2.366e-03 2.329 0.0199 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6869 on 9132 degrees of freedom
## Multiple R-squared: 0.0005937, Adjusted R-squared: 0.0004842
## F-statistic: 5.425 on 1 and 9132 DF, p-value: 0.01987
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Last_Claim,
## data = Data)
##
## Coefficients:
## (Intercept) Months_Since_Last_Claim
## 7886.345 7.856
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Last_Claim,
## data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6155 -3995 -2224 963 75196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7886.345 129.534 60.882 <2e-16 ***
## Months_Since_Last_Claim 7.856 7.137 1.101 0.271
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6871 on 9132 degrees of freedom
## Multiple R-squared: 0.0001326, Adjusted R-squared: 2.314e-05
## F-statistic: 1.211 on 1 and 9132 DF, p-value: 0.2711
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Policies, data = Data)
##
## Coefficients:
## (Intercept) Number_of_Policies
## 7817.73 63.11
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Policies, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5983 -3918 -2337 831 75381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7817.73 114.56 68.240 <2e-16 ***
## Number_of_Policies 63.11 30.07 2.099 0.0359 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6870 on 9132 degrees of freedom
## Multiple R-squared: 0.000482, Adjusted R-squared: 0.0003726
## F-statistic: 4.404 on 1 and 9132 DF, p-value: 0.03588
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Total_Claim_Amount, data = Data)
##
## Coefficients:
## (Intercept) Total_Claim_Amount
## 5679.932 5.356
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Total_Claim_Amount, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12595 -4042 -1804 1128 71707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5679.9318 125.9190 45.11 <2e-16 ***
## Total_Claim_Amount 5.3561 0.2411 22.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6693 on 9132 degrees of freedom
## Multiple R-squared: 0.05128, Adjusted R-squared: 0.05118
## F-statistic: 493.6 on 1 and 9132 DF, p-value: < 2.2e-16
#Split the data into training and holdout sets using a 90-10 split
set.seed(123)
trainIndex <- createDataPartition(Data$Customer_Lifetime_Value, p=0.8, list = FALSE)
trainData <- Data[trainIndex, ]
holdoutData <- Data[-trainIndex, ]
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Coverage + Vehicle_Class +
## Monthly_Premium_Auto + Total_Claim_Amount, data = trainData,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11522 -3422 -946 570 56555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1957.8049 728.5766 2.687 0.00722 **
## CoverageExtended 312.8582 277.1374 1.129 0.25898
## CoveragePremium 898.0097 586.5817 1.531 0.12583
## Vehicle_ClassLuxury Car 2185.6535 1530.9625 1.428 0.15344
## Vehicle_ClassLuxury SUV 2700.4376 1525.6442 1.770 0.07676 .
## Vehicle_ClassSports Car 1304.6222 581.2719 2.244 0.02483 *
## Vehicle_ClassSUV 1276.7265 505.1394 2.527 0.01151 *
## Vehicle_ClassTwo-Door Car 153.8725 190.8422 0.806 0.42011
## Monthly_Premium_Auto 62.3692 10.9021 5.721 1.1e-08 ***
## Total_Claim_Amount -0.9269 0.3264 -2.840 0.00453 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6272 on 7300 degrees of freedom
## Multiple R-squared: 0.1554, Adjusted R-squared: 0.1544
## F-statistic: 149.3 on 9 and 7300 DF, p-value: < 2.2e-16
## GVIF Df GVIF^(1/(2*Df))
## Coverage 5.923714 2 1.560086
## Vehicle_Class 20.427826 5 1.352142
## Monthly_Premium_Auto 26.347164 1 5.132949
## Total_Claim_Amount 1.674327 1 1.293958
The Monthly Premium Auto has a VIF adjusted higher than 5 that indicates some multicollienarity. To adress that We will use the Stepwise regression to select the appropiate features.
## Start: AIC=127845.3
## Customer_Lifetime_Value ~ Coverage + Vehicle_Class + Monthly_Premium_Auto +
## Total_Claim_Amount
##
## Df Sum of Sq RSS AIC
## - Coverage 2 92543790 2.8727e+11 127844
## - Vehicle_Class 5 393074168 2.8757e+11 127845
## <none> 2.8717e+11 127845
## - Total_Claim_Amount 1 317241437 2.8749e+11 127851
## - Monthly_Premium_Auto 1 1287480143 2.8846e+11 127876
##
## Step: AIC=127843.6
## Customer_Lifetime_Value ~ Vehicle_Class + Monthly_Premium_Auto +
## Total_Claim_Amount
##
## Df Sum of Sq RSS AIC
## - Vehicle_Class 5 3.2126e+08 2.8759e+11 127842
## <none> 2.8727e+11 127844
## + Coverage 2 9.2544e+07 2.8717e+11 127845
## - Total_Claim_Amount 1 3.1446e+08 2.8758e+11 127850
## - Monthly_Premium_Auto 1 1.0322e+10 2.9759e+11 128100
##
## Step: AIC=127841.8
## Customer_Lifetime_Value ~ Monthly_Premium_Auto + Total_Claim_Amount
##
## Df Sum of Sq RSS AIC
## <none> 2.8759e+11 127842
## + Vehicle_Class 5 3.2126e+08 2.8727e+11 127844
## + Coverage 2 2.0727e+07 2.8757e+11 127845
## - Total_Claim_Amount 1 3.4071e+08 2.8793e+11 127848
## - Monthly_Premium_Auto 1 3.5501e+10 3.2309e+11 128691
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto +
## Total_Claim_Amount, data = trainData, na.action = na.omit)
##
## Coefficients:
## (Intercept) Monthly_Premium_Auto Total_Claim_Amount
## 724.9909 82.3935 -0.9586
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto +
## Total_Claim_Amount, data = trainData, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12225 -3516 -909 253 56787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 724.9909 212.4388 3.413 0.000647 ***
## Monthly_Premium_Auto 82.3935 2.7434 30.034 < 2e-16 ***
## Total_Claim_Amount -0.9586 0.3258 -2.942 0.003269 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6274 on 7307 degrees of freedom
## Multiple R-squared: 0.1542, Adjusted R-squared: 0.154
## F-statistic: 666.1 on 2 and 7307 DF, p-value: < 2.2e-16
Only 2 features were left: Monthly_Premium_Auto and Total_Claim_Amount.
#Eliminate outliers that are more than 2.5 standard residuals
sresid <- rstandard(step_lm)
idx <- order(sresid, decreasing=TRUE)
sresid[idx[325]]
resid(step_lm)[idx[1]]
outlier_record <- trainData[idx[1], c('Monthly_Premium_Auto', 'Total_Claim_Amount')]
trainData$sresid <- rstandard(step_lm)
#trainData <- trainData %>%
# filter(sresid < 2.5) %>%
#filter(sresid > -2.5)
Despite numerous records surpassing the 2.5 standard deviation benchmark, we have opted against data exclusion, as a significant proportion adheres to this threshold. Removing these records may potentially compromise the integrity of our model.
## RMSE Rsquared MAE Resample
## 1 5553.824 0.1746101 3614.963 Fold01
## 2 6152.378 0.1148496 3913.138 Fold02
## 3 6235.549 0.1819838 3844.474 Fold03
## 4 6277.868 0.1064881 3985.924 Fold04
## 5 6336.034 0.1626300 3883.500 Fold05
## 6 5957.007 0.1503061 3744.616 Fold06
## 7 6212.078 0.2165131 3821.449 Fold07
## 8 6823.927 0.1706952 3970.905 Fold08
## 9 6372.496 0.1236572 3870.934 Fold09
## 10 6741.269 0.1529866 4062.650 Fold10
## Linear Regression
##
## 7310 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 6578, 6578, 6580, 6578, 6579, 6580, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6266.243 0.155472 3871.255
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto +
## Total_Claim_Amount, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12225 -3516 -909 253 56787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 724.9909 212.4388 3.413 0.000647 ***
## Monthly_Premium_Auto 82.3935 2.7434 30.034 < 2e-16 ***
## Total_Claim_Amount -0.9586 0.3258 -2.942 0.003269 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6274 on 7307 degrees of freedom
## Multiple R-squared: 0.1542, Adjusted R-squared: 0.154
## F-statistic: 666.1 on 2 and 7307 DF, p-value: < 2.2e-16
## Monthly_Premium_Auto Total_Claim_Amount
## 1.667533 1.667533
## RMSE: 6433.889
## R-squared 0.1738726
The model performs even better on data it haven’t seem before.
Previously, We found that only when number of policies is 2 the relationship between Customer_Lifetime_Value (target value) and Monthly_Premium_Auto (a feature) vary hugely. We suspect that the model could improve it’s accuaracy by filtering to exclude those records.
#Split the data into training and holdout sets using a 80-20 split
Data_filtered <- Data %>%
filter(Number_of_Policies != 2)
set.seed(123)
trainIndex_filtered <- createDataPartition(Data_filtered$Customer_Lifetime_Value, p=0.8, list = FALSE)
trainData_filtered <- Data_filtered[trainIndex_filtered, ]
holdoutData_filtered <- Data_filtered[-trainIndex_filtered, ]
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto +
## Total_Claim_Amount, data = trainData_filtered, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5090.8 -1649.4 707.3 1527.4 7428.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 282.3492 74.6233 3.784 0.000156 ***
## Monthly_Premium_Auto 58.5334 0.9699 60.350 < 2e-16 ***
## Total_Claim_Amount -0.7583 0.1143 -6.634 3.58e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1918 on 5469 degrees of freedom
## Multiple R-squared: 0.4961, Adjusted R-squared: 0.4959
## F-statistic: 2692 on 2 and 5469 DF, p-value: < 2.2e-16
## RMSE Rsquared MAE Resample
## 1 1908.713 0.4537858 1782.501 Fold01
## 2 1941.120 0.4918133 1798.385 Fold02
## 3 1890.676 0.4469801 1749.627 Fold03
## 4 1919.904 0.4895175 1781.199 Fold04
## 5 1955.704 0.5143336 1804.137 Fold05
## 6 1945.127 0.5310277 1785.329 Fold06
## 7 1880.966 0.4924325 1745.159 Fold07
## 8 1925.587 0.5012236 1775.582 Fold08
## 9 1957.513 0.5555398 1791.329 Fold09
## 10 1858.595 0.4548237 1732.396 Fold10
## Linear Regression
##
## 5472 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4926, 4924, 4925, 4925, 4924, 4926, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1918.39 0.4931478 1774.564
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## RMSE: 1921.308
## R-squared 0.5151335
The model performs even better on data it haven’t seem before.
A strong relationship between the coverage (Basic, Extended, or Premium) and CLV exists, with higher CLV for more premium coverages.
Vehicle Class also impacts CLV, with luxury cars having higher CLV.
Monthly Premium Auto, the amount paid monthly for auto insurance, had the strongest correlation to CLV.
An inverse correlation was observed between Total Claim Amount and CLV.
The Number of Policies feature showed that customers with two policies had significantly higher CLV, affecting linearity.
Due to multicollinearity between Coverage, Vehicle Class, and Monthly Premium Auto, only the most promising feature (Monthly Premium Auto) was retained for the model.
Our first model had an RMSE of 6433.889 and an R-squared of 0.1738 on the holdout sample. Although not highly accurate, the model’s coefficients can help make informed business decisions to increase CLV.
The second model performed better with an RMSE of 1921.308 and an R-Squared of 0.5151. The improved performance was due to the identification of an irregular behavior associated with customers having two policies. We recommend investigating this behavior to determine the cause, such as a special offer for purchasing two policies. By understanding and addressing this issue, we can develop an even more accurate model to predict CLV.